Ch03 - Storage and Retrieval

Chapter 3: Storage and retrieval

Use log for writing key + value; key + value
Index: Key -> byte offsets mapping
Used by - Bitcask - default storage engine for Riak - hashmap stored in memory
Good for: value of each key is updated frequently
Log compaction:
- Segment log files
- Keep only the most recent update for each key
- Merging and compaction is written to a new file - in background
- After compaction, switch to new merged segment
Lookup: each segment has its own hash table. First check most recent segment for key, then prior, …
File format: binary format is more space efficient and faster than CSVs
Deleting records: write tombstone record. During merge, discard prev entries
Crash recovery: If the database is restarted, the in-memory hash maps are lost. Snapshot segment hashmap on disk to speed up recovery
Concurrency control: one writer thread. Because of immutability, reads can be concurrent
Limitations:
- Table must fit in memory. Not good for large number of keys. Hashmap on disk bad perf: lots of random IO
- Range queries not efficient

SSTables introduced by Google bigtable paper
Sorted String table
- each key only appears once within each merged segment
- Efficient merging
- When write comes in - use in-memory balanced tree (i.e. memtable)
- After size too big: write to disk as SSTable file
- When read: first search in memtable, then in most recent on-disk segment, etc.
- Run merge and compaction in background
- Problem: crashes - most recent write is lost <-- keep a separate log for each write immediately on disk
Make LSM tree (log-structured merge-tree) out of SSTables
- keeping a cascade of SSTables that are merged in the background
- LSM storage engines: based on this principle of merging and compacting sorted files
- Performance optimizations: bad perf if key does not exist -- use a bloom filter

Compared to log structed indexes : MB sized segments; B tree breaks 4KB blocks/pages. corresponds more closely to the underlying hardware.
Most databases can fit into a B-tree that is three or four levels
Grow a B tree by splitting a page
to make the database resilient to crashes: WAL - write-ahead log or redo log
Concurrency: protect with latches
Optimizations:
- Use copy-on-write instead of WAL
- Abbreviating keys
- Try to keep leaf pages sequential - harder than LSM trees
- Additional pointers - such as sibling pages

LSM-trees are typically faster for writes, whereas B-trees are thought to be faster for reads - however no conclusive benchmark
Pro of LSM trees:
- A B-tree index must write every piece of data at least twice & write an entire page at a time - write amplication
- LSM-trees are typically able to sustain higher write throughput than B- trees, partly because they sometimes have lower write amplification
- LSM-trees can be compressed better - B-tree storage engines leave some disk space unused
Con of LSM
- compaction process can sometimes interfere with the performance of ongoing RWs - impact higher percentiles-
- If ill-configured, compaction cannot keep up with the rate of incoming writes - unmerged segments on disk keeps growing until you run out of disk space, and reads also slow down because they need to check more segment files
- B tree each key exists in exactly one place - easier for locks & transaction semantics
B tree common in relational, LSM in newer DBs

Storing values within index:
- row is stored elsewhere (ie heapfile)
- clustered index - row with index
- In MySQL innDB - primary key is clusterd, secondary index refer to the primary key
- Compromise: include some columns - covering index
Multi-column indexes
- Concatenated index
- Multi-dimensional index for spatial data
  - Option 1: translate 2 dimensional location into a single number using a space filling curve
  - Option 2: R trees with PostgreSQL’s Generalized Search Tree
Full-text search and fuzzy indexes
In memory DB:
- performance advantage of in-memory databases is not due to the fact that they don’t need to read from disk.
- But because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk
- Also offer data models difficult to implement with disk based indexes e.g. Redis
- Future: NVM (non-volatile memory)